This material has been adapted from ALSF CCDL training materials.
PLIER method for unsupervised
machine learning for human transcriptomics dataComplexHeatmap packageggplot2 plots (but there
will be more on ggplot2 in a later lesson).As we’ve seen in the course so far, we can explore data with unsupervised machine learning approaches like clustering or PCA. Often, these methods can work with any generic dataset. In this notebook, we’ll introduce a machine learning technique that is specifically for gene expression data.
The dataset we’re using comes from the OpenPBTA project. We’ll be using medulloblastoma data only.
# Bit o' data wranglin' expected
library(tidyverse)
── Attaching core tidyverse packages ───────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ─────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Heatmap
library(ComplexHeatmap)
Loading required package: grid
========================================
ComplexHeatmap version 2.24.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
If you use it in published research, please cite either one:
- Gu, Z. Complex Heatmap Visualization. iMeta 2022.
- Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
genomic data. Bioinformatics 2016.
The new InteractiveComplexHeatmap package can directly export static
complex heatmaps into an interactive Shiny app with zero effort. Have a try!
This message can be suppressed by:
suppressPackageStartupMessages(library(ComplexHeatmap))
========================================
Make sure PLIER is installed.
if (!("remotes" %in% installed.packages())) {
install.packages("remotes")
}
if (!("PLIER" %in% installed.packages())) {
# Install PLIER from GitHub
remotes::install_github("wgmao/PLIER@v0.1.6")
}
# Load Pathway-Level Information ExtractoR
library(PLIER)
Loading required package: RColorBrewer
Loading required package: gplots
Attaching package: 'gplots'
The following object is masked from 'package:stats':
lowess
Loading required package: pheatmap
Attaching package: 'pheatmap'
The following object is masked from 'package:ComplexHeatmap':
pheatmap
Loading required package: glmnet
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-9
Loading required package: knitr
Loading required package: rsvd
Loading required package: qvalue
In this notebook, we’ll use a method called Pathway-Level Information Extractor (PLIER) (Mao et al. (2019)).
We like PLIER for a few reasons:
PLIER is similar to other pathway analysis methods that you may be familiar with in that it uses prior knowledge in the form of gene sets. It produces output values that are on an individual sample level and does not require a two group comparison ahead of time like some pathway analysis methods. However, PLIER is designed to align the LVs it constructs with the relevant input gene sets that the data supports, whereas other methods will use all gene sets you provide as input.
Here’s an overview of the PLIER method from Mao et al. (2019) (Figure 1).
Fig. 1 | PLIER overview. PLIER is a matrix factorization approach that decomposes gene expression data into a product of a small number of LVs and their corresponding gene associations or loadings, while constraining the loadings to align with the most relevant automatically selected subset of prior knowledge. a, Given two inputs, the gene expression matrix Y and the prior knowledge (represented as binary gene set membership in matrix C), the method returns the LVs (B), their loadings (Z), and an additional sparse matrix (U) that specifies which (if any) prior-information gene sets and pathways are used for each LV. The light gray area of U indicates the large number of zero elements of the matrix. We apply our method to a whole-blood human gene expression dataset. b, The positive entries of the resulting U matrix are visualized as a heat map, facilitating the identification of the correspondence between specific LVs and prior biological knowledge. As the absolute scale of the U matrix is arbitrary, each column is normalized to a maximum of 1. c, We validate the LVs mapped to specific leukocyte cell types by comparing PLIER estimated relative cell-type proportions with direct measurements by mass cytometry. Dashed lines represent 0.05, 0.01, and 0.001 significance levels for Spearman rank correlation (one-tailed test). NK cell, natural killer cell.
We’ve prepared the model ahead of time to save time during the course. You can see what steps we took to complete model training here and, more generally, how we setup the module here.
# The file containing the PLIER::PLIER() output is saved in the results
# directory
plier_file <- file.path(
"results",
"plier",
"medulloblastoma_plier_model.rds"
)
# Read in the RDS file that contains the PLIER::PLIER() output
plier_results <- read_rds(plier_file)
What does the output of model training look like?
View(plier_results)
The U matrix tells us about how the latent variables learned
by the model relate to the pathways we used as input.
plotU() is a special function to display the U
matrix.
PLIER::plotU(plier_results,
fontsize_row = 6
)
summary() of a PLIER results object returns
the FDR and AUC values for input pathway to latent variable
relationships.
plier_results$summary %>%
filter(FDR < 0.05) %>%
arrange(FDR)
The B matrix contains the latent variable values for each sample.
dim(plier_results$B)
[1] 58 121
Let’s take a peek at the matrix itself.
plier_results$B[1:5, 1:5]
BS_09Z7TC35 BS_1AYRM596 BS_1BWP5MCT BS_1QXEC43H
LV 1 -0.03069958 -0.1704239 0.2077939 0.04329289
2,REACTOME_GPCR_LIGAND_BINDING -0.34255613 -0.3082474 0.8230139 0.29574368
LV 3 -0.47525627 -0.2621213 -0.3022120 0.24102044
LV 4 -0.21868401 0.1518692 -0.6468820 -0.38166945
5,REACTOME_NEURONAL_SYSTEM 0.26218943 0.5620391 -0.7948679 -0.94005443
BS_1TWCV047
LV 1 0.40554013
2,REACTOME_GPCR_LIGAND_BINDING -0.51289392
LV 3 0.04078855
LV 4 -0.25064184
5,REACTOME_NEURONAL_SYSTEM 0.23807593
The Z matrix contains the gene loadings (how genes combine to get B).
dim(plier_results$Z)
[1] 6678 58
We can use Z to tell us which genes contribute to individual LVs by accessing the column corresponding to that LV. We’ll use 20 below, but you can change the number to suit your purposes!
head(
sort(plier_results$Z[, 20],
decreasing = TRUE
),
n = 25
)
S100A12 PADI4 AQP9 CXCR1 APOBEC3A TREML2 HAL NFE2
1.1461282 1.0785668 1.0682841 1.0636914 1.0423266 1.0133903 1.0095173 1.0077395
FFAR2 FPR2 CSF3R TREM1 PGLYRP1 FPR1 MEFV FCGR3B
1.0047816 0.9932291 0.9842613 0.9670226 0.9420513 0.9385210 0.9363859 0.9284551
S100A9 NCF2 S100A8 KRT23 CXCR2 VNN2 BTNL8 FCN1
0.9197182 0.9129084 0.9126968 0.9096359 0.9041732 0.8730759 0.8699117 0.8675057
CEACAM3
0.8498744
For biological discovery, we are often most interested in the latent variables that have some kind of association with an input gene set or pathway. We can use the FDR values in the summary data frame to filter to only the latent variables with a significant association (and their associated gene sets).
# Filter to LV-pathway relationships with FDR < 0.05
sig_summary_df <- plier_results$summary %>%
dplyr::filter(FDR < 0.05)
sig_summary_df
# We only want a single instance of each LV index
sig_index <- as.integer(unique(sig_summary_df$`LV index`))
# Get the LV by sample matrix from the PLIER results and subset it to only those
# LVs with an FDR < 0.05 (at least one pathway)
b_matrix <- plier_results$B
sig_b_matrix <- b_matrix[sig_index, ]
Let’s make a heatmap of the latent variable values for the variables that are significantly associated with an input pathway.
We can make one pretty easily using the ComplexHeatmap
package.
Heatmap(sig_b_matrix)
I’m not sure that’s so useful on its own, so let’s make some improvements!
Medulloblastoma has molecular subtypes, and we have molecular subtype labels for these samples. We can use this information to annotate our heatmap, but first, we need to read it in!
# Read in metadata
histologies_df <- read_tsv(
file.path(
"data",
"metadata",
"pbta-histologies-medulloblastoma-rnaseq.tsv"
)
)
Rows: 121 Columns: 3
── Column specification ─────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): Kids_First_Biospecimen_ID, short_histology, molecular_subtype
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Create a data frame that only has the biospecimen identifiers and the
# molecular subtype labels
subtype_df <- histologies_df %>%
select(
Kids_First_Biospecimen_ID,
molecular_subtype
)
Heatmap annotations require the sample identifiers to be the rownames, so let’s set that up.
annotation_df <- subtype_df |>
tibble::column_to_rownames("Kids_First_Biospecimen_ID") |>
as.data.frame()
Now we’re ready to make a heatmap annotation using a palette that is color vision deficiency friendly.
# Get a vector of hex codes for the Okabe-Ito palette
okabe_ito_palette <- unname(palette.colors(palette = "Okabe-Ito"))
# Create a sample HeatmapAnnotation
sample_annotation <- HeatmapAnnotation(
# Sample to molecular subtype mapping
df = annotation_df,
# Colors for the annotation
col = list(molecular_subtype = c(
"MB, Group3" = okabe_ito_palette[1],
"MB, Group4" = okabe_ito_palette[2],
"MB, SHH" = okabe_ito_palette[3],
"MB, To be classified" = okabe_ito_palette[4],
"MB, WNT" = okabe_ito_palette[5]
)),
# Make the label for the annotation look a bit nicer than the column name
# would
annotation_label = "Molecular Subtype"
)
Check that the order of samples is the same in the annotation and the matrix being annotated.
identical(rownames(annotation_df), colnames(sig_b_matrix))
[1] TRUE
Now we can make a heatmap object with some adjustments explained in the inline comments. We’ll plot it in the next chunk where we adjust the legend position.
heatmap_object <- Heatmap(sig_b_matrix,
# Add molecular subtype annotation
top_annotation = sample_annotation,
# The sample names were hard to read
show_column_names = FALSE,
# Make the row names a bit smaller
row_names_gp = gpar(fontsize = 6),
# Let's add some space between the cells
rect_gp = gpar(col = "white", lwd = 0.25),
# Make the heatmap legend horizontal instead of
# Vertical
heatmap_legend_param = list(direction = "horizontal")
)
To adjust the legend positions, we can use
heatmap_legend_side and annotation_legend_side
with draw().
draw(heatmap_object,
# Put heatmap legend below the heatmap
heatmap_legend_side = "bottom",
# Put the annotation legend below the heatmap
annotation_legend_side = "bottom"
)
Let’s save this heatmap as a PNG.
It can be helpful to keep all the plots organized in the same folder, so let’s set that up first.
# Call the folder plots
plots_dir <- file.path("plots")
# Create it if it doesn't exist yet
dir.create(plots_dir, showWarnings = FALSE, recursive = TRUE)
Now, save as a PNG in the plots directory.
heatmap_file <- file.path(
plots_dir,
"medulloblastoma_significant_lvs_heatmap.png"
)
# Plotting device
png(heatmap_file, width = 8, height = 5, units = "in", res = 300)
# Draw the heatmap and legends again
draw(heatmap_object,
# Put heatmap legend below the heatmap
heatmap_legend_side = "bottom",
# Put the annotation legend below the heatmap
annotation_legend_side = "bottom"
)
# Shut down device
dev.off()
png
2
In order to use ggplot2, we’ll need the data in “long”
or “tidy” format. PLIER outputs what we want to plot in
what we call “wide” format.
Read more about tidy data here.
To quote from Hadley Wickham’s R for Data Science:
There are three interrelated rules which make a dataset tidy:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
Let’s look at a toy example.
set.seed(12345)
toy_df <- data.frame(
cbind(
c("GENEA", "GENEB", "GENEC"),
matrix(rnorm(30), ncol = 10)
)
)
colnames(toy_df) <- c("Gene", paste0("Sample", 1:10))
toy_df is now in “wide” format.
toy_df
Let’s get it into “long” format.
toy_long_df <- tidyr::pivot_longer(toy_df,
# The data is in every column except the one
# named "Gene"
cols = -Gene,
# What will we call the column of the old df
# column names?
names_to = "Sample",
# What will we call the column of values
# from the old df?
values_to = "Expression"
)
toy_long_df
Let’s remove these toy examples from the workspace.
rm(toy_df, toy_long_df)
Let’s look at what format the LV values are in currently.
# First, create a data frame of and add a column with LV identifiers
sig_b_wide <- data.frame(sig_b_matrix) %>%
tibble::rownames_to_column(var = "LV")
sig_b_wide
We want this in long format for plotting. We’ll use
tidyr::pivot_longer to do this just like in the toy example
above.
sig_b_df <- tidyr::pivot_longer(sig_b_wide,
cols = starts_with("BS_"),
names_to = "Kids_First_Biospecimen_ID",
values_to = "LV_estimate"
)
head(sig_b_df)
Right now the LV column has values that contain two
pieces of information: the LV index and the pathway that the LV has been
named for.
Remember, just because a LV is named for a single pathway, that
doesn’t mean that that is the only input pathway that is significantly
associated with that latent variable - always check
summary!
Now let’s add relevant metadata to the data frame so we can use that for plotting.
# Add the subtype labels to the LV estimates
sig_b_df <- inner_join(
x = sig_b_df,
y = subtype_df,
by = "Kids_First_Biospecimen_ID"
)
We’ll plot LV20; this is the latent variable that we looked at the loadings for in an earlier chunk. You can try using a different LV if you would like!
# PLIER names certain latent variables based on their association with input
# gene sets
lv_to_plot <- rownames(plier_results$B)[20]
# For plotting, subset only to the rows corresponding to this latent variable
lv_plot_df <- sig_b_df %>%
filter(LV == lv_to_plot)
Let’s start by making a simple boxplot.
# Make a boxplot where samples are grouped by molecular subtype
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot()
It can often be helpful to visualize individual samples.
# Add individual points with geom_jitter()
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter()
We’re able to globally adjust the aesthetics of the jitter points.
# Improve the aesthetics of the points
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5)
Add a built-in ggplot2 theme.
# Use @jaclyn-taroni's favorite theme :)
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_bw()
Use a different color palette.
# Add "Okabe-Ito" color scheme
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_bw() +
scale_color_manual(values = unname(palette.colors(palette = "Okabe-Ito")))
Add a title to the plot.
# Use labs() to add a title
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_bw() +
scale_color_manual(values = unname(palette.colors(palette = "Okabe-Ito"))) +
labs(title = lv_to_plot)
Center the title and make it bigger and in bold.
# Use theme() to improve the way the title looks
ggplot(
lv_plot_df,
aes(
x = molecular_subtype,
y = LV_estimate,
group = molecular_subtype,
color = molecular_subtype
)
) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
theme_bw() +
scale_color_manual(values = unname(palette.colors(palette = "Okabe-Ito"))) +
labs(title = lv_to_plot) +
theme(plot.title = element_text(
size = 15,
face = "bold",
hjust = 0.5
))
Use the next chunks to further customize your plot. We might suggest starting with the x- and y-axis labels.
?labs
Try a new customization!
Save the last plot to a PNG file. ggplot2 has a function
named ggsave() that we can use to do that.
boxplot_file <- file.path(
plots_dir,
"medulloblastoma_plier_lv20_boxplot.png"
)
ggsave(boxplot_file, plot = last_plot())
Saving 7 x 5 in image
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] PLIER_0.99.0 qvalue_2.40.0 rsvd_1.0.5
[4] knitr_1.50 glmnet_4.1-9 Matrix_1.7-3
[7] pheatmap_1.0.13 gplots_3.2.0 RColorBrewer_1.1-3
[10] ComplexHeatmap_2.24.0 lubridate_1.9.4 forcats_1.0.0
[13] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[16] readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
[19] ggplot2_3.5.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 farver_2.1.2 bitops_1.0-9
[4] fastmap_1.2.0 digest_0.6.37 timechange_0.3.0
[7] lifecycle_1.0.4 cluster_2.1.8.1 survival_3.8-3
[10] magrittr_2.0.3 compiler_4.5.0 rlang_1.1.6
[13] sass_0.4.10 tools_4.5.0 yaml_2.3.10
[16] labeling_0.4.3 bit_4.6.0 plyr_1.8.9
[19] KernSmooth_2.23-26 withr_3.0.2 BiocGenerics_0.54.0
[22] stats4_4.5.0 caTools_1.18.3 colorspace_2.1-1
[25] scales_1.4.0 gtools_3.9.5 iterators_1.0.14
[28] cli_3.6.5 rmarkdown_2.29 crayon_1.5.3
[31] ragg_1.4.0 generics_0.1.4 rstudioapi_0.17.1
[34] reshape2_1.4.4 tzdb_0.5.0 rjson_0.2.23
[37] cachem_1.1.0 splines_4.5.0 parallel_4.5.0
[40] matrixStats_1.5.0 vctrs_0.6.5 jsonlite_2.0.0
[43] IRanges_2.42.0 hms_1.1.3 GetoptLong_1.0.5
[46] S4Vectors_0.46.0 bit64_4.6.0-1 clue_0.3-66
[49] systemfonts_1.2.3 foreach_1.5.2 jquerylib_0.1.4
[52] glue_1.8.0 codetools_0.2-20 stringi_1.8.7
[55] shape_1.4.6.1 gtable_0.3.6 pillar_1.10.2
[58] htmltools_0.5.8.1 circlize_0.4.16 R6_2.6.1
[61] textshaping_1.0.1 doParallel_1.0.17 vroom_1.6.5
[64] evaluate_1.0.3 lattice_0.22-7 png_0.1-8
[67] bslib_0.9.0 Rcpp_1.0.14 xfun_0.52
[70] pkgconfig_2.0.3 GlobalOptions_0.1.2